seo

Ranking the 6 Most Accurate Keyword Difficulty Tools

Ali JalilPour June 27, 2024

0 0 4 minutes read

In January of 2018 Brafton began a massive organic keyword targeting campaign, amounting to over 90,000 words of blog content being published.

Did it work?

Well, yeah. We doubled the number of total keywords we rank for in less than six months. By using our advanced keyword research and topic writing process published earlier this year we also increased our organic traffic by 45% and the number of keywords ranking in the top ten results by 130%.

But we got a whole lot more than just traffic.

From planning to execution and performance tracking, we meticulously logged every aspect of the project. I’m talking blog word count, MarketMuse performance scores, on-page SEO scores, days indexed on Google. You name it, we recorded it.

As a byproduct of this nerdery, we were able to draw juicy correlations between our target keyword rankings and variables that can affect and predict those rankings. But specifically for this piece…

How well keyword research tools can predict where you will rank.

Table of Contents

A little background

We created a list of keywords we wanted to target in blogs based on optimal combinations of search volume, organic keyword difficulty scores, SERP crowding, and searcher intent.

We then wrote a blog post targeting each individual keyword. We intended for each new piece of blog content to rank for the target keyword on its own.

With our keyword list in hand, my colleague and I manually created content briefs explaining how we would like each blog post written to maximize the likelihood of ranking for the target keyword. Here’s an example of a typical brief we would give to a writer:

This image links to an example of a content brief Brafton delivers to writers.

Between mid-January and late May, we ended up writing 55 blog posts each targeting 55 unique keywords. 50 of those blog posts ended up ranking in the top 100 of Google results.

We then paused and took a snapshot of each URL’s Google ranking position for its target keyword and its corresponding organic difficulty scores from Moz, SEMrush, Ahrefs, SpyFu, and KW Finder. We also took the PPC competition scores from the Keyword Planner Tool.

Our intention was to draw statistical correlations between between our keyword rankings and each tool’s organic difficulty score. With this data, we were able to report on how accurately each tool predicted where we would rank.

This study is uniquely scientific, in that each blog had one specific keyword target. We optimized the blog content specifically for that keyword. Therefore every post was created in a similar fashion.

Do keyword research tools actually work?

We use them every day, on faith. But has anyone ever actually asked, or better yet, measured how well keyword research tools report on the organic difficulty of a given keyword?

Today, we are doing just that. So let’s cut through the chit-chat and get to the results…

This image ranks each of the 6 keyword research tools, in order, Moz leads with 4.95 stars out of 5, followed by KW Finder, SEMrush, AHREFs, SpyFu, and lastly Keyword Planner Tool.

While Moz wins top-performing keyword research tool, note that any keyword research tool with organic difficulty functionality will give you an advantage over flipping a coin (or using Google Keyword Planner Tool).

As you will see in the following paragraphs, we have run each tool through a battery of statistical tests to ensure that we painted a fair and accurate representation of its performance. I’ll even provide the raw data for you to inspect for yourself.

Let’s dig in!

The Pearson Correlation Coefficient

Yes, statistics! For those of you currently feeling panicked and lobbing obscenities at your screen, don’t worry — we’re going to walk through this together.

In order to understand the relationship between two variables, our first step is to create a scatter plot chart.

Below is the scatter plot for our 50 keyword rankings compared to their corresponding Moz organic difficulty scores.

This image shows a scatter plot for Moz's keyword difficulty scores versus our keyword rankings. In general, the data clusters fairly tight around the regression line.

We start with a visual inspection of the data to determine if there is a linear relationship between the two variables. Ideally for each tool, you would expect to see the X variable (keyword ranking) increase proportionately with the Y variable (organic difficulty). Put simply, if the tool is working, the higher the keyword difficulty, the less likely you will rank in a top position, and vice-versa.

This chart is all fine and dandy, however, it’s not very scientific. This is where the Pearson Correlation Coefficient (PCC) comes into play.

The PCC measures the strength of a linear relationship between two variables. The output of the PCC is a score ranging from +1 to -1. A score greater than zero indicates a positive relationship; as one variable increases, the other increases as well. A score less than zero indicates a negative relationship; as one variable increases, the other decreases. Both scenarios would indicate a level of causal relationship between the two variables. The stronger the relationship between the two veriables, the closer to +1 or -1 the PCC will be. Scores near zero indicate a weak or no relatioship.

Phew. Still with me?

So each of these scatter plots will have a corresponding PCC score that will tell us how well each tool predicted where we would rank, based on its keyword difficulty score.

We will use the following table from statisticshowto.com to interpret the PCC score for each tool:

Coefficient Correlation R Score	Key
.70 or higher	Very strong positive relationship
.40 to +.69	Strong positive relationship
.30 to +.39	Moderate positive relationship
.20 to +.29	Weak positive relationship
.01 to +.19	No or negligible relationship
0	No relationship [zero correlation]
-.01 to -.19	No or negligible relationship
-.20 to -.29	Weak negative relationship
-.30 to -.39	Moderate negative relationship
-.40 to -.69	Strong negative relationship
-.70 or higher	Very strong negative relationship

In order to visually understand what some of these relationships would look like on a scatter plot, check out these sample charts from Laerd Statistics.

These scatter plots show three types of correlations: positive, negative, and no correlation. Positive correlations have data plots that move up and to the right. Negative correlations move down and to the right. No correlation has data that follows no linear pattern

And here are some examples of charts with their correlating PCC scores (r):

These scatter plots show what different PCC values look like visually. The tighter the grouping of data around the regression line, the higher the PCC value.

The closer the numbers cluster towards the regression line in either a positive or negative slope, the stronger the relationship.

That was the tough part – you still with me? Great, now let’s look at each tool’s results.

Test 1: The Pearson Correlation Coefficient

Now that we’ve all had our statistics refresher course, we will take a look at the results, in order of performance. We will evaluate each tool’s PCC score, the statistical significance of the data (P-val), the strength of the relationship, and the percentage of keywords the tool was able to find and report keyword difficulty values for.

In order of performance:

#1: Moz

Revisiting Moz’s scatter plot, we observe a tight grouping of results relative to the regression line with few moderate outliers.

Moz Organic Difficulty Predictability
PCC	0.412
P-val	.003 (P<0.05)
Relationship	Strong
% Keywords Matched	100.00%

Moz came in first with the highest PCC of .412. As an added bonus, Moz grabs data on keyword difficulty in real time, rather than from a fixed database. This means that you can get any keyword difficulty score for any keyword.

In other words, Moz was able to generate keyword difficulty scores for 100% of the 50 keywords studied.

#2: SpyFu

This image shows a scatter plot for SpyFu's keyword difficulty scores versus our keyword rankings. The plot is similar looking to Moz's, with a few larger outliers.

Visually, SpyFu shows a fairly tight clustering amongst low difficulty keywords, and a couple moderate outliers amongst the higher difficulty keywords.

SpyFu Organic Difficulty Predictability
PCC	0.405
P-val	.01 (P<0.05)
Relationship	Strong
% Keywords Matched	80.00%

SpyFu came in right under Moz with 1.7% weaker PCC (.405). However, the tool ran into the largest issue with keyword matching, with only 40 of 50 keywords producing keyword difficulty scores.

#3: SEMrush

This image shows a scatter plot for SEMrush's keyword difficulty scores versus our keyword rankings. The data has a significant amount of outliers relative to the regression line.

SEMrush would certainly benefit from a couple mulligans (a second chance to perform an action). The Correlation Coefficient is very sensitive to outliers, which pushed SEMrush’s score down to third (.364).

SEMrush Organic Difficulty Predictability
PCC	0.364
P-val	.01 (P<0.05)
Relationship	Moderate
% Keywords Matched	92.00%

Further complicating the research process, only 46 of 50 keywords had keyword difficulty scores associated with them, and many of those had to be found through SEMrush’s “phrase match” feature individually, rather than through the difficulty tool.

The process was more laborious to dig around for data.

#4: KW Finder

This image shows a scatter plot for KW Finder's keyword difficulty scores versus our keyword rankings. The data also has a significant amount of outliers relative to the regression line.

KW Finder definitely could have benefitted from more than a few mulligans with numerous strong outliers, coming in right behind SEMrush with a score of .360.

KW Finder Organic Difficulty Predictability
PCC	0.360
P-val	.01 (P<0.05)
Relationship	Moderate
% Keywords Matched	100.00%

Fortunately, the KW Finder tool had a 100% match rate without any trouble digging around for the data.

#5: Ahrefs

This image shows a scatter plot for AHREF's keyword difficulty scores versus our keyword rankings. The data shows tight clustering amongst low difficulty score keywords, and a wide distribution amongst higher difficulty scores.

Ahrefs comes in fifth by a large margin at .316, barely passing the “weak relationship” threshold.

Ahrefs Organic Difficulty Predictability
PCC	0.316
P-val	.03 (P<0.05)
Relationship	Moderate
% Keywords Matched	100%

On a positive note, the tool seems to be very reliable with low difficulty scores (notice the tight clustering for low difficulty scores), and matched all 50 keywords.

#6: Google Keyword Planner Tool

This image shows a scatter plot for Google Keyword Planner Tool's keyword difficulty scores versus our keyword rankings. The data shows randomly distributed plots with no linear relationship.

Before you ask, yes, SEO companies still use the paid competition figures from Google’s Keyword Planner Tool (and other tools) to assess organic ranking potential. As you can see from the scatter plot, there is in fact no linear relationship between the two variables.

Google Keyword Planner Tool Organic Difficulty Predictability
PCC	0.045
P-val	Statistically insignificant/no linear relationship
Relationship	Negligible/None
% Keywords Matched	88.00%

SEO agencies still using KPT for organic research (you know who you are!) — let this serve as a warning: You need to evolve.

Test 1 summary

For scoring, we will use a ten-point scale and score every tool relative to the highest-scoring competitor. For example, if the second highest score is 98% of the highest score, the tool will receive a 9.8. As a reminder, here are the results from the PCC test:

This bar chart shows the final PCC values for the first test, summarized.

And the resulting scores are as follows:

Tool	PCC Test
Moz	10
SpyFu	9.8
SEMrush	8.8
KW Finder	8.7
Ahrefs	7.7
KPT	1.1

Moz takes the top position for the first test, followed closely by SpyFu (with an 80% match rate caveat).

Test 2: Adjusted Pearson Correlation Coefficient

Let’s call this the “Mulligan Round.” In this round, assuming sometimes things just go haywire and a tool just flat-out misses, we will remove the three most egregious outliers to each tool’s score.

Here are the adjusted results for the handicap round:

Adjusted Scores (3 Outliers removed)	PCC	Difference (+/-)
SpyFu	0.527	0.122
SEMrush	0.515	0.150
Moz	0.514	0.101
Ahrefs	0.478	0.162
KWFinder	0.470	0.110
Keyword Planner Tool	0.189	0.144

As noted in the original PCC test, some of these tools really took a big hit with major outliers. Specifically, Ahrefs and SEMrush benefitted the most from their outliers being removed, gaining .162 and .150 respectively to their scores, while Moz benefited the least from the adjustments.

For those of you crying out, “But this is real life, you don’t get mulligans with SEO!”, never fear, we will make adjustments for reliability at the end.

Here are the updated scores at the end of round two:

Tool	PCC Test	Adjusted PCC	Total
SpyFu	9.8	10	19.8
Moz	10	9.7	19.7
SEMrush	8.8	9.8	18.6
KW Finder	8.7	8.9	17.6
AHREFs	7.7	9.1	16.8
KPT	1.1	3.6	4.7

SpyFu takes the lead! Now let’s jump into the final round of statistical tests.

Test 3: Resampling

Being that there has never been a study performed on keyword research tools at this scale, we wanted to ensure that we explored multiple ways of looking at the data.

Big thanks to Russ Jones, who put together an entirely different model that answers the question: “What is the likelihood that the keyword difficulty of two randomly selected keywords will correctly predict the relative position of rankings?”

He randomly selected 2 keywords from the list and their associated difficulty scores.

Let’s assume one tool says that the difficulties are 30 and 60, respectively. What is the likelihood that the article written for a score of 30 ranks higher than the article written on 60? Then, he performed the same test 1,000 times.

He also threw out examples where the two randomly selected keywords shared the same rankings, or data points were missing. Here was the outcome:

Resampling	% Guessed correctly
Moz	62.2%
Ahrefs	61.2%
SEMrush	60.3%
Keyword Finder	58.9%
SpyFu	54.3%
KPT	45.9%

As you can see, this tool was particularly critical on each of the tools. As we are starting to see, no one tool is a silver bullet, so it is our job to see how much each tool helps make more educated decisions than guessing.

Most tools stayed pretty consistent with their levels of performance from the previous tests, except SpyFu, which struggled mightily with this test.

In order to score this test, we need to use 50% as the baseline (equivalent of a coin flip, or zero points), and scale each tool relative to how much better it performed over a coin flip, with the top scorer receiving ten points.

For example, Ahrefs scored 11.2% better than flipping a coin, which is 8.2% less than Moz which scored 12.2% better than flipping a coin, giving AHREFs a score of 9.2.

The updated scores are as follows:

Tool	PCC Test	Adjusted PCC	Resampling	Total
Moz	10	9.7	10	29.7
SEMrush	8.8	9.8	8.4	27
Ahrefs	7.7	9.1	9.2	26
KW Finder	8.7	8.9	7.3	24.9
SpyFu	9.8	10	3.5	23.3
KPT	1.1	3.6	-.4	.7

So after the last statistical accuracy test, we have Moz consistently performing alone in the top tier. SEMrush, Ahrefs, and KW Finder all turn in respectable scores in the second tier, followed by the unique case of SpyFu, which performed outstanding in the first two tests (albeit, only returning results on 80% of the tested keywords), then falling flat on the final test.

Finally, we need to make some usability adjustments.

Usability Adjustment 1: Keyword Matching

A keyword research tool doesn’t do you much good if it can’t provide results for the keywords you are researching. Plain and simple, we can’t treat two tools as equals if they don’t have the same level of practical functionality.

To explain in practical terms, if a tool doesn’t have data on a particular keyword, one of two things will happen:

You have to use another tool to get the data, which devalues the entire point of using the original tool.
You miss an opportunity to rank for a high-value keyword.

Neither scenario is good, therefore we developed a penalty system. For each 10% match rate under 100%, we deducted a single point from the final score, with a maximum deduction of 5 points. For example, if a tool matched 92% of the keywords, we would deduct .8 points from the final score.

One may argue that this penalty is actually too lenient considering the significance of the two unideal scenarios outlined above.

The penalties are as follows:

Tool	Match Rate	Penalty
KW Finder	100%	0
Ahrefs	100%	0
Moz	100%	0
SEMrush	92%	-.8
Keyword Planner Tool	88%	-1.2
SpyFu	80%	-2

Please note we gave SEMrush a lot of leniency, in that technically, many of the keywords evaluated were not found in its keyword difficulty tool, but rather through manually digging through the phrase match tool. We will give them a pass, but with a stern warning!

Usability Adjustment 2: Reliability

I told you we would come back to this! Revisiting the second test in which we threw away the three strongest outliers that negatively impacted each tool’s score, we will now make adjustments.

In real life, there are no mulligans. In real life, each of those three blog posts that were thrown out represented a significant monetary and time investment. Therefore, when a tool has a major blunder, the result can be a total waste of time and resources.

For that reason, we will impose a slight penalty on those tools that benefited the most from their handicap.

We will use the level of PCC improvement to evaluate how much a tool benefitted from removing their outliers. In doing so, we will be rewarding the tools that were the most consistently reliable. As a reminder, the amounts each tool benefitted were as follows:

Tool	Difference (+/-)
Ahrefs	0.162
SEMrush	0.150
Keyword Planner Tool	0.144
SpyFu	0.122
KWFinder	0.110
Moz	0.101

In calculating the penalty, we scored each of the tools relative to the top performer, giving the top performer zero penalty and imposing penalties based on how much additional benefit the tools received over the most reliable tool, on a scale of 0–100%, with a maximum deduction of 5 points.

So if a tool received twice the benefit of the top performing tool, it would have had a 100% benefit, receiving the maximum deduction of 5 points. If another tool received a 20% benefit over of the most reliable tool, it would get a 1-point deduction. And so on.

Tool	% Benefit	Penalty
Ahrefs	60%	-3
SEMrush	48%	-2.4
Keyword Planner Tool	42%	-2.1
SpyFu	20%	-1
KW Finder	8%	-.4
Moz	–	0

Results

All told, our penalties were fairly mild, with a slight shuffling in the middle tier. The final scores are as follows:

Tool	Total Score	Stars (5 max)
Moz	29.7	4.95
KW Finder	24.5	4.08
SEMrush	23.8	3.97
Ahrefs	23.0	3.83
Spyfu	20.3	3.38
KPT	-2.6	0.00

Conclusion

Using any organic keyword difficulty tool will give you an advantage over not doing so. While none of the tools are a crystal ball, providing perfect predictability, they will certainly give you an edge. Further, if you record enough data on your own blogs’ performance, you will get a clearer picture of the keyword difficulty scores you should target in order to rank on the first page.

For example, we know the following about how we should target keywords with each tool:

Tool	Average KD ranking ≤10	Average KD ranking ≥ 11
Moz	33.3	37.0
SpyFu	47.7	50.6
SEMrush	60.3	64.5
KWFinder	43.3	46.5
Ahrefs	11.9	23.6

This is pretty powerful information! It’s either first page or bust, so we now know the threshold for each tool that we should set when selecting keywords.

Stay tuned, because we made a lot more correlations between word count, days live, total keywords ranking, and all kinds of other juicy stuff. Tune in again in early September for updates!

We hope you found this test useful, and feel free to reach out with any questions on our math!

Disclaimer: These results are estimates based on 50 ranking keywords from 50 blog posts and keyword research data pulled from a single moment in time. Search is a shifting landscape, and these results have certainly changed since the data was pulled. In other words, this is about as accurate as we can get from analyzing a moving target.